MBON Acoustic Indices Study
Biostats Review — Methods & Preliminary Results
2025-12-12
Research Question
Can acoustic indices predict biological community metrics in estuarine environments?
- Location: 3 stations, May River, South Carolina
- Period: 2021 (full year)
- Responses: 9 community metrics
- Fish: activity, richness, presence
- Dolphins: echolocation, burst pulse, whistle, total activity, presence
- Vessels: presence
- Predictors: ~60 acoustic indices (candidates)
Data Overview
- 13,102 observations (2-hour temporal bins)
- 4 data sources aligned to common resolution:
| Detections |
Manual annotations of fish/dolphin/vessel presence |
| Environment |
Temperature, depth (sensor data) |
| Acoustic Indices |
~60 indices across 5 categories |
| SPL |
Sound pressure levels |
- Index categories: Amplitude, Complexity, Diversity, Spectral, Temporal
- Temporal structure: station / month / day / hour
Pipeline Overview
Stage 00: Data Alignment (4 sources → 2-hour bins)
↓
Stage 01: Index Reduction (60 → 14 via correlation/VIF)
↓
Stage 02-03: Response Variables + Feature Engineering
↓
Stage 05: GAMM modeling (mgcv::bam)
Model choice: GAMM (Generalized Additive Mixed Model)
- Allows non-linear (smooth) relationships between predictors and response
- Increasingly common in ecological literature for this type of study
- Preliminary comparison with GLMM showed strong GAMM preference (ΔAIC > 6000)
Model Specifications
GAMM (mgcv::bam)
- Smooth terms (k=5) for indices & covariates
- Cyclic splines for hour, day-of-year
- Random effects: station, month
- AR1 via
rho parameter
Model types:
- Negative binomial (nbinom2) for count responses
- Binomial for presence/absence responses
Question 1: Index Reduction
Is our approach appropriate? Should we reduce further?
Index Reduction: What We Did
Step 1: Correlation pruning
- Removed one index from each pair with |r| > 0.6
- Result: 60 → 17 indices
Step 2: VIF screening
- Iteratively removed indices with VIF > 2
- Result: 17 → 14 indices
Outcome:
- 14 indices retained
- All 5 categories preserved (Amplitude, Complexity, Diversity, Spectral, Temporal)
Index Reduction: Concerns
Is 14 indices too many?
Is correlation + VIF the right approach?
- Alternatives: PCA, LASSO, elastic net (others?)
Model shrinkage removed 4 more
- GAMM
select=TRUE shrunk ADI, BioEnergy, EPS_KURT, MEANt to ~zero
- Should we report 14 predictors or 10 “effective” predictors?
- Should we formalize this as a two-stage approach? Remove the “shrunk” indices and rerun?
Index Reduction: Questions for Discussion
Q1: Is correlation + VIF standard practice, or is there a more appropriate approach?
Q2: Given that GAMM shrinkage removed 4 indices, should we adopt a two-stage approach (VIF → model-based selection)?
Q3: 14 predictors for 13K observations — but effective sample size is lower due to temporal autocorrelation.
Question 2: Modeling Results
What are our results telling us? Any concerns?
GAMM Results: Significant Predictors
| hour_of_day |
8.24 |
<0.001 |
Strong diel pattern |
| ACI |
2.66 |
<0.001 |
Non-linear, positive |
| BI |
2.82 |
<0.001 |
Non-linear, negative |
| EAS |
3.08 |
<0.001 |
Non-linear |
| VARt |
2.94 |
<0.001 |
Non-linear |
| depth |
1.00 |
<0.001 |
Linear, negative |
Shrunk away (not significant): ADI, BioEnergy, EPS_KURT, MEANt
GAMM Smooth Plots: Overview
Smooth Zoom: hour_of_day
Observations:
- Strong diel pattern (EDF = 8.2)
- Peak activity ~8 PM (hour 20)
- Lowest ~10 AM (hour 10)
Validation:
- Matches known fish calling behavior
- Model is capturing real biology
Smooth Zoom: BI (negative relationship)
Observation:
- Higher BI → less fish activity
- Counterintuitive?
Possible explanations:
- BI elevated when other sources dominate (snapping shrimp?)
- Fish call when BI is lower
- Seasonal confounding?
Question: Ecologically interpretable or artifact?
Smooth Zoom: VARt (non-linear)
Observation:
- “Goldilocks” relationship
- Fish activity peaks at intermediate VARt
- Drops at both extremes
Implication:
- This non-linear pattern is common in ecology
- GAMM smooths capture it naturally
Unexpected Results
Temperature:
- NOT significant in GAMM (p = 0.12)
Day of year:
- NOT significant in GAMM (p = 0.18)
- Despite visible seasonality in data →
Hypothesis: Acoustic indices absorb the seasonal/temperature signal?
Methodological Concerns
- 10 of 14 indices significant
- Genuine signal or overfitting?
- Large sample size (13K) means small effects can be significant
- AR1 autocorrelation
- Currently using fixed
rho = 0.6
- Should we estimate rho from the data?
- Indices absorbing environmental signal?
- Temperature and seasonality not significant
- But indices vary with both — collinearity concern?
Question 3: Validation Approach
How much validation do we need for a journal paper?
Inference vs Prediction: Options
| Inference only |
Full-data GAMM, report coefficients |
Simpler, answers “are there relationships?” |
Reviewers may question generalizability |
| Inference + light CV |
Add leave-one-station-out validation |
Shows relationships are robust |
Slightly more work |
| Full predictive |
Extensive CV, prediction metrics |
Strongest for “applied monitoring” framing |
Scope creep |
Our tentative plan: Inference + light CV
- Primary: Understanding which indices relate to community metrics
- Supporting: Simple CV showing relationships generalize
- Future work: Operational prediction applications
Summary: Questions for Discussion
Index reduction: Is correlation + VIF appropriate? Should we reduce further, or is model shrinkage sufficient?
Interpretation: Temperature/seasonality not significant — absorbed by indices, or a problem?
Validation: Is inference + light CV sufficient for publication? What kind of CV would reviewers expect?
Next steps: Expand to all 9 responses? Additional diagnostics?
Additional Context
- Pilot mode: Results shown are for
fish_activity only — will expand to all 9 responses
- MEANt: No real variation in raw data (numerical noise ~10⁻¹⁹) — model correctly shrunk it away
- Repository: [link] — full specs, code, and data pipeline available